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Abstract 

Text diacritic restoration is a very vital problem for 
languages that use diacritics in their orthography sys- 
tems. Actually, it plays an important role for improv- 
ing the performance of many NLP tasks. In this pa- 
per, we handle the problem of Arabic text diacritiza- 
tion; such that our system diacritizes input sequence 
of words both morphologically and syntactically. The 
operation of the system is divided into three layers; 
each layer handles a specific problem. To evaluate the 
performance of the system, we used the benchmark 
LDC Arabic Treebank datasets used by the state of 
the art systems, for the sake of fair comparison. Be- 
sides, we also used an extra test set to give an indi- 
cation of the real performance of the system on any 
totally independent data set. 

For morphological diacritization, we use both Hidden 
Markov Model and an external morphological anal- 
yser to achieve high accuracy as well as high coverage. 
The morphological diacritization WER achieved by 
the system on the benchmark test set is 3.7%. We also 
introduce the use of Random Forest for the syntactic 
diacritization and show how this simple and light clas- 
sifiers is very effective such that it outperforms very 
powerful classifiers by achieving syntactical WER of 
8.3%. Finally, we provide time analysis for each com- 
ponent of the proposed system. 

Keywords: Arabic Text Diacritization, Natural 
Language Processing, Machine Learning. 


Nomenclature 

BAM A Buckwalter Arabic Morphological Analyzer 

CRF Conditional Random Fields 

DER Diacritic Error Rate 

DNN Deep Neural Network 

HMM Hidden Markov Model 

LDC Linguistic Data Consortium 

NLP Natural Language Processing 


OOV Out of Vocabulary 
POS Part of Speech 
SVM Support Vector Machine 
WER Word Error Rate 

1 Introduction 

The Arabic language belongs to a class of languages 
that uses some subscript and superscript signs, called 
diacritics, to determine the exact pronunciation of 
words. Generally, different diacritics on the same let- 
ters produce different words with maybe very different 
meanings. 

However, most modern standard Arabic scripts are 
written without any diacritics. We have some ex- 
ceptions like important religious text, or text that is 
targeted to the beginner learners of the Arabic lan- 
guage. Given these facts, the lack of diacritics does 
introduce an additional source of ambiguity. Native 
Arabic speakers find it a very easy task to deduce 
most of the missing diacritics and to infer the cor- 
rect pronunciation and meaning of the word using the 
information provided in the context. However, this 
is not an easy task for Arabic beginner learners and, 
of course, for NLP applications that need the input 
words to be presented in their diacritized form as a 
preprocessing step. 

In his work [11], M.Maamouri et al. showed that NLP 
applications can benefit from diacritizing input text as 
a preprocessing step. They experimented the effect of 
diacritizing input text as a preprocessing step before 
handling the main task which was parsing. The results 
showed that the accuracy of the parsing is improved 
when the input text is diacritized. The improvement 
was not significant because both parsing and diacriti- 
zation use very similar linguistic features. However, 
the effect of diacritization on other NLP applications 
is expected to be significant, according to Maamouri. 
Actually, a long list of NLP applications is in need of 
a reliable automatic diacritization system to achieve 
higher accuracy. Examples are Machine Translation, 
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Text To Speech, Automatic Speech Recognition, and 
Word Sense Disambiguation. 

The Arabic orthography consists of 28 different 
letters; out of them, 25 letters are constants while 
the remaining three letters represent long vowels. Be- 
sides, 8 different diacritics are used to indicate the 
exact pronunciation of the letters. Table 1 lists dif- 
ferent diacritics and the corresponding pronunciation 
resulting from attaching the diacritic to the Arabic 
constant letter ”j”. Each of the Arabic letters can be 
diacritized using one or more diacritics. The only pos- 
sible combination of diacritics that can be attached 
to a letter is: ”Shadda” and one of these diacritics: 
’’Fatha”, ’’Kasra”, or ’’Dumma”. 


be inferred from the grammar of the Arabic language. 
However, the grammar controlling the syntactic dia- 
critics consists of a large set of syntax rules. These 
syntax rules have many special cases such that even 
recent native Arabic speakers face considerable dif- 
ficulty in predicting the syntactic diacritics correctly. 
Table 3 shows different syntactic diacritics of the same 
morphologically diacritized word. 

Table 3: Different Syntactic Diacritics of The Same 


Word: 

Sentence 

Role of the word 

Hhb .lie. 1 JiA 

Subject 

AoJtil Sac. du l3j 

Object 


Table 1: Arabic diacritic set 


Diacritic 

Diacritized 

Letter 

Pronunciation 

Fatha 

J 

A//a/ 

Kasra 

J 

SIN 

Dumma 

5 

J 

/r//u/ 

Tanween Fath 

b 

/r/ /an/ 

Tanween Kasr 

J 

/r/ /in/ 

Tanween Dumm 

J 

/r//un/ 

Shadda 

J 

A/A/ 

Sukon 

J 

A/ 


The process of Arabic text diacritization can be 
divided into two types: the morphological diacriti- 
zation and the syntactic diacritization. To be fully 
diacritized, each word has to be diacritized both mor- 
phologically and syntactically. The morphological di- 
acritization is the process of restoring the word di- 
acritics that depend only on the word itself and its 
meaning. Morphological diacritics of a certain word, 
with a certain definition, stay unchanged regardless 
the context of the word. Actually, morphological di- 
acritics distinguish a word from other words having 
the same letters. Table 2 shows an example of how 
different morphological diacritics on the same letters 
give different words with different meanings. 

Generally, most Arabic native speakers predict the 
correct morphological diacritics very easily and at a 
very high accuracy using the information provided in 
the context. 


Table 2: Examples of words with the same letters 
but with different morphological diacritics. 


Word 

Part of Speech 

Meaning 

Sac. 

Noun 

Contract 

Sac. 

Noun 

Necklace 

Sac. 

Verb 

Held 

Sac. 

Verb 

Was held 


On the other hand, syntactic diacritics are depen- 
dent on the role of the word in the sentence. It can 


Actually, there are many challenges facing the 
problem of Arabic text diacritization. First, the na- 
ture of the Arabic morphological system itself is very 
complex. Generally, any Arabic word can be factor- 
ized into prefix, stem, and postfix. But, actually, 
there can be more then just one prefix and one post- 
fix attached to a single stem. For example, the Arabic 
word is itself a complete sentence which can 

be translated into English as: so he will suffice you 
against them. 

The second challenge is the lack of large fully an- 
notated corpus to be used for training. For exam- 
ple, the training corpus we use in this work, which is 
the benchmark training corpus, contains only 288.000 
words. Of course, this number of words is not enough 
to train a system for applying such a sophisticated 
task at a high accuracy. 

The third challenge is that for some Arabic words, dif- 
ferent possible diacritizations can exist for the same 
word with exactly the same meaning and the same 
POS tag. For example the Arabic translation of the 
word ”the prophet” can be ”jyujil” or ”J>yil’’. An- 
other example is the Arabic translation of the word 
’’electricity” which can be either ” or 

Another challenge is that there is no unified way to 
write the Arabic sentence. The sentence can start 
with a word of almost any POS tag with no re- 
strictions. The subject can come before or after the 
verb. Furthermore, the subject or the verb can be 
totally omitted from the sentence and left for the 
reader /audience to guess them. Finally, beside all 
these challenges that are inherent in the nature of the 
Arabic language, there are some very common spelling 
mistakes that exist even in formal documents; for ex- 
ample, letters in these sets (Till) ,(<<), (<J <j) are 
commonly replaced, incorrectly, by other letters in the 
same set. 

In this work, we extend our work in [12] by enhanc- 
ing the operations of each of the system layers, in- 
troducing the use of the Random Forest classifier for 
syntactic diacritization , and embedding knowledge of 
the Arabic syntactic grammar to further improve the 
performance of the syntactic diacritization. 
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The rest of the paper is organized as follows: sec- 
tion 2 presents different approaches for handling the 
Arabic text diacritization problem, as proposed by 
other researchers. Then, we give the details of our 
proposed system, in section 3. Next, we describe the 
experiments we hold to test the performance of our 
system and to compare it against the state of the art 
systems, in section 4. After that, the time analysis is 
given in section 5. Finally, we give the conclusion and 
suggestion for future work in section 5. 

2 Related Work 

Researchers used several techniques for handling the 
problem of restoring missing diacritics. Some of 
these techniques are rule-based, morphological-based, 
statistical-based, or example-based. Regardless the 
variations of the details, these different techniques can 
fit into four main categories: (1) Dealing with the 
problem as a statistical machine translation problem. 
(2) Dealing with the problem as a sequence labelling 
problem. (3) Factorizing words into their basic seg- 
ments using morphological analysis. (4) Hybrid of 
two or more of the approaches listed above. The text 
diacritization problem can be regarded as a statisti- 
cal Machine Translation problem in which the undia- 
critized text is considered to be the source language 
text, and the diacritized text is considered to be the 
target language text. Elshafei et al. [7] proposed the 
use of HMM to predict the most probable sequence 
of diacritized words. The main disadvantage of their 
system is being of low coverage. The system fails to 
diacritize words that haven’t been encountered in the 
training data. To face the problem of low coverage, 
Schlippe et al. [22] proposed a system that uses char- 
acter level n-grams for OOV words. During the real 
operation of the system, every word is checked to de- 
termine whether it have been encountered during the 
training phase. If so, the word level n-gram model is 
used to determine the most probable diacritized word. 
Otherwise, word characters are diacritized using the 
character level n-gram model by choosing the most 
likely diacritized character given the neighboring char- 
acters. 

Other researchers handled the problem as a sequence 
labelling problem in which every word is represented 
as a sequence of letters. Every letter may be tagged 
with any of the possible diacritics. Rashwan et al. 
(2014) [17] and Rashwan et al. (2015) [16] used a Deep 
Neural Network (DNN) based framework to restore 
the missing diacritics. They used a set of morpholog- 
ical, syntactic, and context features as input to the 
DNN. As a post-processing step, they introduced the 
use of Contention Sub-set Resolution network to en- 
hance the performance of the syntactic diacritization. 
Although the deep neural network achieved very high 
accuracy compared to other proposed systems, the se- 
quence labelling approach usually gives lower accu- 


racy than other approaches. The exception is when 
the classifier is very powerful, like the DNN. But, the 
cost then is the large memory and time requirements. 
Zitouni et al. [25] used an approach based on maxi- 
mum entropy to restore the missing diacritics. They 
used lexical, segment-based, and POS features for the 
maximum entropy classifier to label the characters 
with the missing diacritics. Instead of labeling each 
character independently, the labeling is chosen such 
that the conditional probability of the sequence of la- 
beled characters is maximized. A beam search algo- 
rithm was also employed to reduce the search space. 
In their system, MAD A, Habash and Rambow 
[10] used the Buckwalter Arabic Morphological Ana- 
lyzer (BAM A) to produce all the possible analyses for 
each input word. Then, they used a set of ten trained 
SVM classifiers to predict the expected POS tag fea- 
tures of the correct analysis. Finally, They chose the 
analysis that best conforms to the predicted POS tag 
features. 

Usually, the hybrid approach combines both the ma- 
chine translation approach and either the sequence la- 
belling approach or the morphological analyser based 
approach to achieve high accuracy as well as high cov- 
erage. Rashwan et al.(2009) [15] introduced a two- 
layer system in which the first layer is used to get 
the diacritics of previously seen words and the second 
layer handles the out-of-vocabulary (OOV) words. In 
the first layer, analyses of words that occurred dur- 
ing training are retrieved and A* search is used to 
infer the most likely sequence of diacritized words. 
Then OOV words are passed into a factorizing mod- 
ule, which consists of a morphological analyser and a 
POS tagger, to get their possible analyses. 
Ananthakrishnan et al. [1] adopt both statistical and 
knowledge-based approaches to solve the diacritiza- 
tion problem. In training phase, they built a statisti- 
cal language model using Treebank data corpus. Dur- 
ing the diacritization, the system at first uses word- 
level trigram model to obtain the most probable dia- 
critization. But, if the word is OOV, the system gets 
all possible morphological analyses of the word using 
the BAM A morphological analyzer, and then it uses 
character-level 4-gram model to rank the possible di- 
cacritizations of the word. 

In [12], we introduced a hybrid technique that uses 
HMM and morphological analyzer to restore the mor- 
phological diacritics. Then, word feature vectors are 
built using some morphological and lexical features. 
A previously trained CRF classifier is then used to 
predict the syntactic diacritics. 

3 Proposed Approach 

We propose an approach consisting of 3 layers, which 
is an extension to the work we proposed in [12], in 
which each layer takes a step towards the complete 
diacritization of the input sequence of words. We first 
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handle the morphological diacritization of the input 
sequence, then we handle the syntactic diacritization 
of the morphologically diacritized sequence. The first 
two layers tackles the morphological diacritization of 
the input sequence. While the first layer is responsible 
for morphologically diacritizing words that have oc- 
curred during the training phase using first-order Hid- 
den Markov Model (HMM), the second layer handles 
the morphological diacritization of the OOV words by 
making use of an external Arabic Morphological Anal- 
yser. After that, the third layer is there to handle 
the syntactic diacritization of the morphologically di- 
acritized sequence by making use of the Random For- 
est classifier. Figure 1 shows the components of our 
approach and the interactions between these compo- 
nents. The details of each of the three layers are given 
in the following subsections. 



Figure 1: System Architecture 


Then, to select the best probable sequence of morpho- 
logically diacritized words, we retrieve word bigrams 
from the database and apply first order HMM. To 
avoid the problem of zero and undefined probabilities 
resulting from OOV words, we use add-delta smooth- 
ing with S = 0.05. Upon selecting the words’ morpho- 
logically diacritized form, POS tag bigrams are used 
to select the corresponding POS tags for the morpho- 
logically diacritized words. The flowchart describing 
the operation of layer 1 is shown in figure 2. 



3.1 Morphological Diacritization of 
Previously Seen Words 


Input words are fed into the first layer sentence by 
sentence. Each word in the input sentence is checked 
to determine whether it has been previously seen 
in the training data or not. If it’s been previously 
seen, the different diacritizations of this word together 
with their unigrams are retrieved from the training 
database. If the word hasn’t occurred during train- 
ing, it is processed again in this layer as we begin 
to ask whether a variant of this word has occurred. 
We do so using simple letter normalization, excluding 
some parts of the word like prefix and postfix, and 
adding some of the allowed prefix and postfix to the 
word, then we check to see whether any of these vari- 
ants of the word has occurred during training or not. 
If any of them has occurred, its diacritized forms are 
retrieved and the corresponding POS tags are formed 
as a combination of POS tag of the retrieved words 
with the POS tag(s) of the prefix and/or the post- 
fix. Otherwise, if none of the word and its variants 
is found, the word is left unchanged and marked as 


OOV. 


Figure 2: Flowchart of layer 1 


3.2 Morphological Diacritization of 
OOV Words 

The output of the first layer enters the second layer 
with the hope of getting the morphological diacritics 
for OOV words. As shown in figure 3, in this layer, 
we make use of the Buckwalter Arabic Morphological 
Analyser (BAMA) to get possible analyses for input 
words. The output of the analyser is a list of possible 
analyses for the input word. The analyses are sorted 
descendingly according to the frequency of the word. 
Each analysis consists of the diacritized form of the 
word and its corresponding POS tag. 

Words are fed into the morphological analyser to get 
their analyses. If the word is previously diacritized in 
the first layer, we use the output of the morphological 
analyser only to extract some features. These features 
are: 1- Can the word be syntactically diacritized with 
any of the ’’Tanween” diacritics? 2- Is it common to 
omit the syntactic diacritic of the word? 

On the other hand, if the word is OOV, the most prob- 


34 


Artificial Intelligence and Machine Learning Journal, ISSN: 1687-4846, Vo. 16, No. 1, Delaware, USA, December 2016 


able analysis is first chosen, and then these features 
are extracted. To select the most probable analysis, 
we make use of information gained during training 
about the expected POS tag as well as information 
provided by the morphological analyser about the rel- 
ative frequency of different possible analyses. We used 
the following equation to calculate the probability of 
each analysis: 


p (Aj) = A.P(posij,posj.i) + (1 - A ).— -.N )(1 


Such that: 

Ay is the A h analysis in the output list of the mor- 
phological analyser for the word at position j, posy is 
the POS tag of the analysis Ay, posyi is the selected 
POS tag of the word at position j-1, A is the linear 
interpolation parameter, 1 > A > 0, and N is a nor- 
malisation factor, N = ^ 1 1 

2 ^ i + 1 

Once the analysis is selected, the corresponding 
diacritization and POS tag are assigned to the OOV 
word. 


3.3 The Syntactic Diacritization 

The task of the third layer, is to predict the syntactic 
diacritics of the input words. The input of this layer 
is the morphologically diacritized words that are an- 
notated with their POS tags, as predicted by the two 
previous layers. Before we start diacritizing a word 
syntactically, we first check its POS tag. Using the 
POS tag of the word, we decide how the syntactic di- 
acritization of this word should be handled. We clas- 
sify every input word into one of three classes based 
on its annotated POS tag. The three classes are : 

(1) Words annotated with noun, adjective, adverb, 
or number POS tags. (2) Words annotated with a 
POS tag of verbs in the present tense form used for 
singular subject. (3) Words of any other POS tag 
like prepositions, conjunctions, pronouns, verbs in the 
past tense form, and verbs used for plural subject. 

We make this classification because the rules for 
diacritizing words in the same class are somewhat sim- 
ilar but different from the rules used for diacritizing 
words in other classes. Generally, words in the first 
class can be syntactically diacritized with any dia- 
critic except ’’Shadda”. But, the process of diacritiz- 
ing these words is based on a large number of complex 
rules. Actually, the restoration of syntactic diacritics 
of words belonging to this class is much more com- 
plicated than the restoration of syntactic diacritics of 
words belonging to other classes. That’s why, we use 
machine learning classifier to syntactically diacritize 
these words. 

To predict the syntactic diacritics of words of class 1, 
we first need to build the feature vector. Beside the 
features extracted from the morphological analyzer in 
layer 2, some other features are extracted for each 
word to help predict its syntactic diacritic. We use a 




Figure 3: Flowchart of layer 2 
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combination of morphological, lexical, POS tag, and 
context features to build the feature vector. Then, 
we subject the feature vector to a previously trained 
Random Forest classifier. The Random Forest, we 
used, consists of 30 decision trees in which every deci- 
sion tree votes for one of the candidate diacritics and 
the final decision, of the Random Forest as a whole, 
is taken as the diacritic with the highest number of 
votes. We used the implementation of the Random 
Forest that is provided in the open source ’’algLib” 
c^ library [2]. 

The reason why we choose the Random Forest clas- 
sifier for the syntactic diacritization task is that it is 
a rule based classifier in which each tree corresponds 
to a sequence of if-then-else checks. This conforms to 
the way in which people with strong Arabic grammar 
background predict the syntactic diacritics. Although 
people have the advantage that they make use of the 
semantics to further help them predict the syntactic 
diacritics at high accuracy, our technique still tries 
to mimic the way people perform this task but with 
much more limited sources of information. 

As for words belonging to the second class, they can 
have any syntactic diacritic except ’’Shadda” and the 
three ’’Tanween” diacritics. There are a small set of 
easy rules that are used to select the correct syntactic 
diacritic for those words. Thus, words belonging to 
this class can be syntactically diacritized using a sim- 
ple rule based classifier represented as a sequence of 
if-then-else statements. These if-then-else statements 
are directly inferred from the grammar that controls 
the syntactic diacritization of these words. 

The syntactic diacritization of words belonging to the 
third class is a very trivial task. Actually, each of 
these words has a fixed case-ending diacritic which 
doesn’t change in any context. So, the syntactic 
diacritic can be simply retrieved from the training 
database. The sequence of operations of layer 3 is 
summarized in figure 4. An example of how the sys- 
tem works on input sequence of words is shown in 
Figure 5. 


4 Experimental Results 

4.1 Data 



Figure 4: Flowchart of layer 3 


Actually, training and test datasets were intentionally 
chosen to be chronologically non-overlapping so that 
the results model how the system will perform in the 
real world [20]. During the rest of the paper, we will 
refer to this test set as Test Set 1. 

Beside this benchmark test set of the Arabic Tree- 
bank, we also tested our system against another test 
set which is a collection of articles from ’’Aljazeera”, 
annotated by RDI company in Egypt. The reason 
why we tested our system against this test set is that 
it’s totally independent of the training data, as the 
articles are from different news agency and there is 
large chronological gap between this test data and the 
training data; most of the articles in this test data 
are published in 2012. This test data consists of 63 
articles having about 30k words and 6k punctuation 
marks. The articles are of different topics from differ- 
ent categories. Their categories are: arts, economics, 
health, politics, sports, and varieties. We will call this 
test set TestSet2. 


We hold our experiments on LDC Arabic Treebank 
Part 3 which consists of about 600 articles from Al- 
Nahar Lebanese news agency. The words of the data 
set are annotated with their morphological and syn- 
tactic diacritics as well as with their POS tags. We 
use the same training and test data sets used by the 
state-of-the-art systems so as to guarantee fair com- 
parison with them. The training dataset consists of 
288K words representing articles from January 15, 
2002 to October 15, 2002. On the other side, the 
test dataset consists of about 52K words representing 
articles from October 15, 2002 to December 15, 2002. 


4.2 Results on TestSetl 

Table 4 and table 5 show the comparison between our 
proposed approach and the state-of-the-art systems on 
TestSetl in terms of the morphological diacritization 
WER and DER, the syntactic diacritization WER, 
and the complete diacritization WER. 

The comparison indicates that both the ap- 
proaches of Rashwan(2014) and Rashwan(2011) out- 
perform our results in the morphological diacritization 
on TestSetl. However, we are still close to them, the 
difference is only about 0.7%. On the other hand, the 


Artificial Intelligence and Machine Learning Journal, ISSN: 1687-4846, Vo. 16, No. 1, Delaware, USA, December 2016 


Input 



Candidate 

Analyses 


for 

Previously 

Seen 

Words 


^ Layer 1 Output [ 

OOV Analyses [ 

[ Extra Feature ■ 
Values 

Layer 2 Output i 

i 1 

Class 

1 

jSyntactic Diacritic! 
Final Output 1 




O* 

ljIa 

Jij 

♦» ** 





cJSs 



N 

R_Pron 

D+N+F_PL 

PV 

PV 




JSS 


3 „ 

PREP 


N 


N 


& 


V 


Pass PV 


jjLLc. 




cJ£j 

jj 

OOV 

N 

PREP 

D+N+F_PL 

N 

N 

JjUafc 






IMP 






(Y,N) 

(Y,Y) 

(Y,N) 

(Y,N) 

(Y,Y) 

(Y,Y) 

JjUaC- 



CllLa^LLall 



(NP,Y,N) 

(N,Y,Y) 

(P,Y,N) 

(D+N+ F_PL,Y,N) 

(N,Y,Y) 

(N,Y,Y) 

1 

1 

3 

1 

1 

1 

Ji 



0 

0* 

* 

ljIa jIxaII 

0 . 

i2 


Figure 5: An example of applying the system on the input sequence qa cjLjk-dl cfc In this 

figure, POS tags are abbreviated as follows: N stands for noun, R_Pron stands for relative pronoun, PREP 
stands for preposition, D stands for determinant, F stands for female, PL stands for plural, PV stands for past 
tense verv, V stands for present tense verb, and Pass stands for passive 
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Table 4: Comparison of the morphological diacritiza- 
tion WER and DER on Test Set 1. The best reported 
results are written in bold 


Approach 

Morphological 

WER 

Morphological 

DER 

Our Approach 

3.7% 

1.6% 

Rashwan(2014) 

3% 

NA 

Rashwan(2011) 

3.1% 

1.2% 

Zitouni 

7.9% 

2.5% 

Habash 

5.5% 

2.2% 


Table 6: The results of the system on TestSetl using 
the concept of case-ending diacritic 


Excluding last character WER 

3.9% 

Case-ending WER 

8.1% 


Table 7: The results of the system on TestSet2 


Morphological WER 

6.3% 

Syntactic WER 

10.1% 

Complete WER 

16.4% 


Table 5: Comparison of the syntactic and full diacriti- 
zation WER on TestSetl. The best reported results 
are written in bold 


Approach 

Syntactic 

WER 

Total 

WER 

Our Approach 

8.3% 

12% 

Rashwan(2014) 

8.6% 

11.6% 

Rashwan(2011) 

9.11% 

12.5% 

Zitouni 

10.1% 

18% 

Habash 

9.4% 

14.9% 


comparison shows that our approach outperforms all 
the other techniques in terms of the syntactic WER by 
at least 0.3%. As for the total WER, the comparison 
shows that the work of Rashwan(2014), outperforms 
our results by only 0.4%. Figure 6 summarizes the 
results of our proposed approach in comparison with 
other approaches on TestSetl. 

4.3 Syntactic Versus Case-Ending 

The syntactic diacritic is the diacritic assigned to the 
last character of the core of the word, the stem, but 
not necessarily the last character of the word as a 
whole. On the other hand, the case-ending diacritic is 
the diacritic assigned to the last character of the word 
as a whole. For example, the word has a syn- 

tactic diacritic ’’Fatha”, but its case-ending diacritic 
’’Dumma”. If the word doesn’t contain any postfix, 



■ Morphological 
WER 

Syntactic WER 

■ Total WER 


Our Work DNN Hybrid Zitouni Habash 
(Rashwan) (Rashwan) 


Figure 6: Graphical representation of the comparison 
between our system and the state-of-the-art systems 


on TestSetl 


the syntactic diacritic and the case-ending diacritic 
are the same. 

The syntactic WER is more accurate for expressing 
the powerful of the syntactic diacritization module 
than the case-ending WER. The syntactic WER is 
also exactly similar to the way people consider words 
correctly syntactically diacritized or not. Actually 
there is no real meaning for the case-ending WER in 
real life. 

However, we noticed that some researchers reported 
the results of the case-ending WER instead of the 
syntactic WER. This also impacts the morphological 
WER, because the diacritic of a letter can be counted 
as a morphological diacritic or a syntactic diacritic ( 
or case-ending diacritic) but not both. That’s why we 
provide the results of our system using the concept of 
case-ending diacritic, here. The results of all word let- 
ters ignoring the last character diacritic as well as the 
results of the case-ending diacritization are reported in 
Table 6. Of course, the complete diacritization WER 
remains unchanged. 

4.4 Results on TestSet2 

As for the results of applying our system on TestSet2, 
they are reported in table 7. 

The results show that the performance of the sys- 
tem is worse than its performance on TestSetl. We 
analysed the results and found that some words are 
counted as incorrect words while they should really 
be classified as correct by humans. This is because of 
the fact that some Arabic words have more than one 
possible diacritizations with exactly the same meaning 
and POS tag. 97 original Arabic words and 361 Arabic 
words of foreign origins exhibit this problem, in Test- 
Set2 . We consulted the Arabic dictionaries to make 
sure that those original Arabic words are correct. The 
words "fijljj , "s)bj" are examples of such words. As for 
the Arabic words from foreign origins like the words 
" L > d a^. i r, "j /buir j" which can’t be found in the Arabic 
dictionaries, we used our knowledge to decide that 
they are both correct. The results of the system on 
this test set taking into consideration the equivalent 
word diacritizations are reported in table 8. 

Although this problem wasn’t significant when we 
used TestSetl, it became significant when we tried 
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Table 8: The results of the system on Test Set 2 taking 
equivalent diacritizations into consideration 


Morphological WER 

5.1% 

Syntactic WER 

10% 

Total WER 

15.3% 


this different test set because the articles in the extra 
test set are annotated by different human annotators 
with, maybe, different cultures or backgrounds. 


5 Time Analysis 

One main advantage of our system is its very com- 
petitive time requirements that make it a very pow- 
erful candidate for online operations. We will present 
time requirements, for both training phase and test- 
ing phase, in terms of the time complexity , in the 
following subsections. 

5.1 Offline Operation Time Require- 
ments 

During the offline operation, we build the system, cal- 
culate all system parameters, and train system mod- 
ules. Thus, the time complexity for training the whole 
system is the summation of time complexities of the 
three layers. 

The time complexity of training the module of the 
first layer which uses first order HMM is the time for 
calculating unigram and bigram frequencies which is 
0(N * logN ); such that N is the number of words 
in the training data. For the second layer, no more 
parameters need to be calculated. Finally, to train the 
Random Forest in the third layer, the time complexity 
is 0(m * k * N * logN)] such that m is the number of 
features and k is the number of decision trees in the 
Random Forest. 

Thus, the time complexity of training the whole system 
is: 0(N * m * k * logN). 

5.2 Online Operation Time Require- 
ments 

To calculate the total time complexity of the system 
in its online operation, we will sum up the time com- 
plexities of the three layer modules. The first layer 
uses first order HMM which has time complexity of 
0(n * | S'! 2 ); such that n is the length of the input 
sentence and \S\ is the number of different diacriti- 
zations, that occurred during training, for each word. 
In the second layer, the main operation is calculating 
the probability for each analysis so as to select the 
best probable analysis for a word form. Calculating 
the probabilities of h analyses of a word takes 0(h) 
time complexity. So, the complexity of this module 
for input sentence of length n is 0(n * h). Finally, 


the last module uses Random Forest to get the most 
probable syntactic diacritic. Using Random Forest 
for a single word takes 0(k * log(N)) where N is the 
number of training samples and k is the number of 
decision trees in the Random Forest. Hence the com- 
plexity of this module for an input sentence of length 
n is 0(n * k * log(N)). Thus, the time complexity 
of the online operation of the system as a whole is: 
0(n* (\S\ 2 + h-\- k*log(N))). However, given the fact 
that S, h, k, and N are constants after the system has 
been built and trained, the overall complexity of the 
online operation becomes O(n). 


6 Conclusion and Future Work 

In this work, we proposed a multi-layer technique for 
handling the problem of Arabic Text Diacritization. 
The technique predicts both the missing morphologi- 
cal and syntactic diacritics at high accuracy. It uses 
first-order HMM as well as BAMA for morphological 
diacritization to achieve both high accuracy and high 
coverage. The WER of the morphological diacriti- 
zation achieved by the system is 3.7% which is very 
competitive to the state-of-the-art techniques. 

We introduced the use of the Random Forest classi- 
fier for the syntactic diacritization and showed that 
this very light classifier achieves very low syntactic 
WER, of just 8.3%. Our approach is much lighter 
than the best performing system, which uses Deep 
Neural Networks. However, the speed issues related 
to Deep Neural Networks make us a powerful rival for 
both online and offline operations. 

As a future work, we may try to use a context sensi- 
tive morphological analysis component with the hope 
of improving the accuracy of OOV words. 
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